March 18, 2019

Testing Causal Theories

Plan for Today:

(1) Correlation

  • correlation
    • technical details
  • random association
    • \(p\) values

Correlation

correlation:

degree of association or relationship between the observed values taken by two variables (\(X\) and \(Y\))

  • Many different ways of doing this (compare group means, regression) are all fundamentally about correlation.
  • correlations have a direction:
    • positive: implies that as \(X\) increases, \(Y\) increases
    • negative: \(X\) increases, \(Y\) decreases
  • correlations have strength (has nothing to do size of effect):
    • strong: \(X\) and \(Y\) almost always move together
    • weak: \(X\) and \(Y\) do not move together very much
  • There is also a technical definition of correlation (later)

Correlation

What is it?

(Pearson) correlation: also has specific mathematical definition (you don't need to know it):

\[r = \frac{\sum_{i}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i^n(x_i - \bar{x})^2}\sqrt{\sum_i^n (y_i - \bar{y})^2}}\]

This captures extent to which deviations from mean of \(X\) move with deviations from mean of \(Y\).

Correlation

What is it?

mathematically: correlation is the degree of linear association between \(X\) and \(Y\)

  • Takes values between \(-1\) and \(1\)
  • Values close to \(1\) or \(-1\) suggest high degree of linear association
  • Values close to \(0\) suggest low degree of linear association
  • Value of correlation does not tell us how much \(Y\) changes with \(X\)

Correlation

What is it?

negative correlation: (correlation \(< 0\)) values of \(X\) and \(Y\) move in opposite direction:

  • higher values of \(X\) appear with lower values of \(Y\)
  • lower values of \(X\) appear with higher values of \(Y\)

positive correlation: (correlation \(> 0\)) values of \(X\) and \(Y\) move in same direction:

  • higher values of \(X\) appear with higher values of \(Y\)
  • lower values of \(X\) appear with lower values of \(Y\)

Correlation

Correlation

Correlation

Correlation

  • It is possible to see perfect correlation but small change in \(Y\) across \(X\)

  • It is possible to see low correlation but large change in \(Y\) across \(X\)

  • It is possible to see perfect nonlinear relationship between \(X\) and \(Y\) with \(0\) correlation

Correlation:

weak correlation: values for \(X\) and \(Y\) do not cluster along line

strong correlation: values for \(X\) and \(Y\) cluster strongly along a line

strength of correlation does not fully determine the slope of line describing \(X,Y\) relationship

effect size: this is the slope of the line describing the \(X,Y\) relationship. The larger the effect, the steeper the slope

Correlation

Correlation

Random Association

Correlation: Random association

How do we know a correlation is systematic?

  • How do we know that it is not simply a pattern by random chance?
  • Apparent patterns can be produced by pure randomness

Correlation: Random association

Correlation: Random association

If you look at enough possible sets of variables, you might find a strong correlation

  • But it could have happened by chance!
  • So a correlation might not be "real"

(Arbitary Correlations)[http://www.tylervigen.com/spurious-correlations]

Correlation: Random association

Random association: Statistics

Field of statistics investigates properties of chance events (stochastic processes):

  • Probability theory tells us how likely events are to happen, given chance
  • Can tell us how likely correlation of some value is to happen by chance

Random association: Statistics

How?

  1. Compute correlation of \(X\) and \(Y\)
  2. How many cases do we have?
    • Patterns with many cases less likely to occur at random
  3. Assign a probability that the correlation we see would have happened by chance

ASSUMPTION

  • We need to assume we know the chance process generating this correlation

Random association: Statistics

Random association: Statistics

Random association: Statistics

Random association: Statistics

Random association: Statistics

Random association: Statistics

Random association: Statistics

Same Correlation, More cases

Random association: Statistics

statistical significance:

An indication of how likely correlation we observe could have happened purely by chance.

higher degree of statistical significance indicates correlation is less likely to have happened by chance

Random association: Statistics

\(p\) value:

  • A numerical measure of statistical significance. Puts a number on how likely observed correlation would have occurred by chance, assuming a we know the chance procedure and the truth is a \(0\) correlation.

  • It is a probability, so is between \(0\) and \(1\).

  • Lower \(p\)-values indicate greater statistical significance

\(p < 0.05\) often used as threshold for "significant" result.

  • but it is not a magic number
  • Can observe \(p < 0.05\) by chance (\(\frac{1}{20}\))

Random association: Statistics

\(p\) value:

Be wary of "\(p\)-hacking"

  • \(p\) values become meaningless if we look at many associations, then only report the ones that are "significant".

Why?

  • low \(p\)-values occur by chance when we look at lots of associations

Significant?

Significant?

Significant?

What else do you want to know?

We'd want to know this

We'd want to know this

Random assocation

Recap:

  1. Correlations can appear by chance
  2. We can assess probability of chance correlation if we know:
    • strength of correlation (close to \(1,-1\))
    • size of the sample (\(N\))
    • underlying chance process
  3. \(p\)-values:
    • Obtained using mathematical formulaa

Random assocation

Recap:

Statistical
Significance
\(p\)-value By Chance? Why? "Real"?
Low High (\(p > 0.05\)) Likely small \(N\)
weak correlation
Probably not
High Low (\(p < 0.05\)) Unlikely large \(N\)
strong correlation
Probably